library(mlr3verse)
library(rattle)Exercise 8 – CART
$$
% math spaces % N, naturals % Z, integers % Q, rationals % R, reals % C, complex % C, space of continuous functions % machine numbers % maximum error % counting / finite sets % set 0, 1 % set -1, 1 % unit interval % basic math stuff % x tilde % argmax % argmin % argmax with limits % argmin with limits% sign, signum % I, indicator % O, order % partial derivative % floor % ceiling % sums and products % summation from i=1 to n % summation from i=1 to m % summation from j=1 to p % summation from j=1 to p % summation from i=1 to k % summation from k=1 to g % summation from j=1 to g % mean from i=1 to n % mean from i=1 to n % mean from k=1 to g % product from i=1 to n % product from k=1 to g % product from j=1 to p % linear algebra % 1, unitvector % 0-vector % I, identity % diag, diagonal % tr, trace % span % <.,.>, scalarproduct % short pmatrix command % matrix A % error term for vectors % basic probability + stats % P, probability % E, expectation % Var, variance % Cov, covariance % Corr, correlation % N of the normal distribution % dist with i.i.d superscript
% … is distributed as …
% X, input space % Y, output space % set from 1 to n % set from 1 to p % set from 1 to g % P_xy % E_xy: Expectation over random variables xy % vector x (bold) % vector x-tilde (bold) % vector y (bold) % observation (x, y) % (x1, …, xp) % Design matrix % The set of all datasets % The set of all datasets of size n % D, data % D_n, data of size n % D_train, training set % D_test, test set % (x^i, y^i), i-th observation % {(x1,y1)), …, (xn,yn)}, data % Def. of the set of all datasets of size n % Def. of the set of all datasets % {x1, …, xn}, input data % {y1, …, yn}, input data % (y1, …, yn), vector of outcomes % x^i, i-th observed value of x% y^i, i-th observed value of y % (x1^i, …, xp^i), i-th observation vector % x_j, j-th feature % (x^1_j, …, x^n_j), j-th feature vector % Basis transformation function phi % Basis transformation of xi: phi^i := phi(xi)
%%%%%% ml - models general % lambda vector, hyperconfiguration vector % Lambda, space of all hpos % Inducer / Inducing algorithm % Set of all datasets times the hyperparameter space % Set of all datasets times the hyperparameter space % Inducer / Inducing algorithm % Inducer, inducing algorithm, learning algorithm
% continuous prediction function f % True underlying function (if a statistical model is assumed) % True underlying function (if a statistical model is assumed) % f(x), continuous prediction function % f with domain and co-domain % hypothesis space where f is from % Bayes-optimal model % Bayes-optimal model% f_j(x), discriminant component function % f hat, estimated prediction function % fhat(x) % f(x | theta) % f(x^(i)) % f(x^(i)) % f(x^(i) | theta) % fhat_D, estimate of f based on D % fhat_Dtrain, estimate of f based on D %model learned on Dn with hp lambda %model learned on D with hp lambda %model learned on Dn with optimal hp lambda %model learned on D with optimal hp lambda
% discrete prediction function h % h(x), discrete prediction function % h hat % hhat(x) % h(x | theta) % h(x^(i)) % h(x^(i) | theta) % Bayes-optimal classification model % Bayes-optimal classification model
% yhat % yhat for prediction of target % yhat^(i) for prediction of ith targiet
% theta % theta hat % theta vector % theta vector hat %% %theta learned on Dn with hp lambda %theta learned on D with hp lambda % min problem theta % argmin theta
% densities + probabilities % pdf of x % p % p(x) % pi(x|theta), pdf of x given theta % pi(x^i|theta), pdf of x given theta % pi(x^i), pdf of i-th x
% pdf of (x, y) % p(x, y) % p(x, y | theta) % p(x^(i), y^(i) | theta)
% pdf of x given y % p(x | y = k) % log p(x | y = k)% p(x^i | y = k)
% prior probabilities % pi_k, prior% log pi_k, log of the prior % Prior probability of parameter theta
% posterior probabilities % P(y = 1 | x), post. prob for y=1 % P(y = k | y), post. prob for y=k % pi with domain and co-domain % Bayes-optimal classification model % Bayes-optimal classification model % pi(x), P(y = 1 | x) % pi, bold, as vector % pi_k(x), P(y = k | x) % pi_k(x | theta), P(y = k | x, theta) % pi(x) hat, P(y = 1 | x) hat % pi_k(x) hat, P(y = k | x) hat % pi(x^(i)) with hat% pi_k(x^(i)) with hat % p(y | x, theta) % p(y^i |x^i, theta) % log p(y | x, theta) % log p(y^i |x^i, theta)
% probababilistic% Bayes rule % mean vector of class-k Gaussian (discr analysis)
% residual and margin % residual, stochastic % epsilon^i, residual, stochastic % residual, estimated % y f(x), margin % y^i f(x^i), margin % estimated covariance matrix % estimated covariance matrix for the j-th class
% ml - loss, risk, likelihood % L(y, f), loss function % L(y, pi), loss function % L(y, f(x)), loss function % loss of observation % loss with f parameterized % loss of observation with f parameterized % loss of observation with f parameterized % loss in classification % loss in classification % loss of observation in classification % loss with pi parameterized % loss of observation with pi parameterized % L(y, h(x)), loss function on discrete classes % L(r), loss defined on residual (reg) / margin (classif) % L1 loss % L2 loss % Bernoulli loss for -1, +1 encoding % Bernoulli loss for 0, 1 encoding % cross-entropy loss % Brier score % R, risk % R(f), risk % risk def (expected loss) % R(theta), risk % R_emp, empirical risk w/o factor 1 / n % R_emp, empirical risk w/ factor 1 / n % R_emp(f) % R_emp(theta) % R_reg, regularized risk % R_reg(theta) % R_reg(f) % hat R_reg(theta) % hat R_emp(theta) % L, likelihood % L(theta), likelihood % L(theta|x), likelihood % l, log-likelihood % l(theta), log-likelihood % l(theta|x), log-likelihood % training error % test error % avg training error
% lm % linear model % OLS estimator in LM
% resampling % size of the test set % size of the train set % size of the i-th test set % size of the i-th train set % index vector train data % index vector test data % index vector i-th train dataset % index vector i-th test dataset % D_train,i, i-th training set% D_test,i, i-th test set
% space of train indices of size n_train % space of train indices of size n_train % space of train indices of size n_test % output vector associated to index J % def of the output vector associated to index J % cali-J, set of all splits % (Jtrain_1,Jtest_1) …(Jtrain_B,Jtest_B) % Generalization error % GE % GE-hat % GE full % GE hat holdout % GE hat holdout i-th set % GE-hat(lam) % GE-hat_I,J,rho(lam) % GE-hat_I,J,rho(lam) % GE formal def % aggregate function % GE of a fitted model % GEh of a fitted model % GE of a fitted model wrt loss L % pointwise loss of fitted model% GE of a fitted model % GE of inducer % GE indexed with data % expected GE % expectation wrt data of size n
% performance measure % perf. measure derived from pointwise loss % matrix of prediction scores % i-th row vector of the predscore mat % predscore mat idxvec J % predscore mat idxvec J and model f % predscore mat idxvec Jtest and model f hat % predscore mat idxvec Jtest and model f% predscore mat i-th idxvec Jtest and model f % def of predscore mat idxvec J and model f % Set of all datasets times HP space
% ml - ROC % no. of positive instances % no. of negative instances % proportion negative instances % proportion negative instances % true/false pos/neg: % true pos % false pos (fp taken for partial derivs) % true neg % false neg
% ml - trees, extra trees % (Parent) node N % node N_k % Left node N_1 % Right node N_2 % class probability node N % estimated class probability node N % estimated class probability left node% estimated class probability right node
% ml - bagging, random forest % baselearner, default m % estimated base learner, default m % baselearner, default m % ensembled predictor % estimated ensembled predictor % ambiguity/instability of ensemble % weight of basemodel m% weight of basemodel m with hat % last baselearner
% ml - boosting % prediction in iteration m % prediction in iteration m % prediction m-1 % prediction m-1 % weighted in-sample misclassification rate % weight vector of basemodel m % weight of obs i of basemodel m % parameters of basemodel m % parameters of basemodel m with hat % baselearner, default m % ensemble % pseudo residuals % pseudo residuals % terminal-region % terminal-region % mean, terminal-regions % mean, terminal-regions with hat% mean, terminal-regions
% ml - boosting iml lecture % theta % BL j with theta % BL j with theta $$
Hint: Useful libraries
Exercise 1: Splitting criteria
Consider the following dataset:
| \(i\) | \(x^{(i)}\) | \(y^{(i)}\) |
|---|---|---|
| 1 | 1.0 | 1.0 |
| 2 | 2.0 | 1.0 |
| 3 | 7.0 | 0.5 |
| 4 | 10.0 | 10.0 |
| 5 | 20.0 | 11.0 |
- Compute the first split point the CART algorithm would find for each data set (with pen and paper or in
R, resp.Python). Use mean squared error (MSE) to assess the empirical risk.
Solution
Pen-and-paper solution
We have only one split variable \(x\). We probe all binary splits, where thresholds are placed equidistant between the observed feature values (think about why this might help generalization).
For each threshold:
- Map observations to child nodes resulting from split at that threshold
- Compute optimal constant prediction and resulting loss in both child nodes
- Aggregate to obtain overall loss of split
- Choose optimal split w.r.t. empirical risk reduction.
Two scenarios in aggregating the risk contributions from both candidate child nodes:
- Child node risk is sum of pointwise losses without accounting for number of observations \(\rightsquigarrow\) simple average
- Child node risk is average of pointwise losses, thus accounting for number of observations \(\rightsquigarrow\) weighted average
Split points
Split candidates \(t \in \{1.5, 4.5, 8.5, 15\}\), leading to the following child nodes:
\[\begin{align*} \mathcal{N}_1(t) &= \{ (x,y) \in \mathcal{N}: x \leq t \}, ~~ \mathcal{N}_2(t) = \{ (x,y) \in \mathcal{N}: x > t \}. \end{align*}\]
Risk per split point
Calculate \(\mathcal{R}(\mathcal{N}, t)\) for each split point:
\(x \leq 1.5\)
- Predictions: \(c_1 = \tfrac{1}{1} \sum_{i = 1}^1 y_i = 1, ~~ c_2 = \tfrac{1}{4} \sum_{i = 2}^5 y_i = 5.625\)
- Risk left: \(\rho_{\text{MSE}}(\mathcal{N}_1, 1.5) = 0\)
- Risk right: \(\rho_{\text{MSE}}(\mathcal{N}_2, 1.5) = \tfrac{1}{4}((1 - 5.625)^2 + (0.5 - 5.625)^2 + (10 - 5.625)^2 + (11 - 5.625)^2)\)
- Aggregate risk: \(\mathcal{R}(\mathcal{N}, 1.5) = \tfrac{|\mathcal{N}_1|}{|\mathcal{N}|} \rho_{\text{MSE}}(\mathcal{N}_1, 1.5) +\tfrac{|\mathcal{N}_2|}{|\mathcal{N}|} \rho_{\text{MSE}}(\mathcal{N}_2, 1.5) = \tfrac{1}{5} \cdot 0 + \tfrac{4}{5} \cdot 23.925 = 19.14\)
\(x \leq 4.5\) \(\rightsquigarrow\) \(\mathcal{R}(\mathcal{N}, 4.5) = 13.43\)
\(x \leq 8.5\) \(\rightsquigarrow\) \(\mathcal{R}(\mathcal{N}, 8.5) = 0.13\) \(\rightsquigarrow\) optimal
\(x \leq 15\) \(\rightsquigarrow\) \(\mathcal{R}(\mathcal{N}, 15) = 12.64\)
Write function to perform MSE-based splits:
Apply to dataset:
Imports
Write function to perform MSE-based splits:
Apply to dataset:
- Derive the optimal constant predictor for a node \(\mathcal{N}\) when minimizing the empirical risk under \(L2\) loss and explain why this is equivalent to minimizing variance impurity.
Solution
Constant \(L2\) risk minimizer for a node \(\mathcal{N}\): \[ \bar y = \operatorname{arg\,min}_c \frac{1}{|\mathcal{N}|} \sum_{i = 1}^{|\mathcal{N}|} (y^{(i)}- c)^2, \]
because
\[\begin{align*} \min_{c \in \mathbb{R}} \frac{1}{|\mathcal{N}|} \sum_{i = 1}^{|\mathcal{N}|} (y^{(i)}- c)^2 &\Longleftrightarrow \frac{\partial{}}{\partial c} \left( \frac{1}{|\mathcal{N}|} \sum_{i = 1}^{|\mathcal{N}|} (y^{(i)}- c)^2 \right) = 0 \\ &\Longleftrightarrow \frac{1}{|\mathcal{N}|} \frac{\partial{}}{\partial c} \left( \sum_{i = 1}^{|\mathcal{N}|} \left({y^{(i)}}^2 - 2 y^{(i)}c + c^2\right) \right) = 0 \\ &\Longleftrightarrow \left( \sum_{i = 1}^{|\mathcal{N}|} \left( -2 y^{(i)}+ 2c\right) \right) = 0 \\ &\Longleftrightarrow |\mathcal{N}| \cdot c = \sum_{i = 1}^{|\mathcal{N}|} y^{(i)}\\ &\Longrightarrow \hat c = \frac{1}{|\mathcal{N}|} \sum_{i = 1}^{|\mathcal{N}|} y^{(i)} = \bar y. \end{align*}\]
It’s easy to see that the mean prediction minimizes variance impurity.
- We just found that \[ \bar y = \operatorname{arg\,min}_{c \in \mathbb{R}} \frac{1}{|\mathcal{N}|} \sum_{i = 1}^{|\mathcal{N}|} (y^{(i)}- c)^2. \]
- Looking closely at this minimization objective reveals: it’s simply the (biased) sample variance for sample mean \(c\).
- Therefore, predicting the sample mean minimizes both \(L2\) risk and variance impurity.
- The mean of a sample is precisely the value for which the sum of squared distances to all points is minimal (the “center” of the data).
- Constant mean prediction is equivalent to an intercept LM (trained with \(L2\) loss) \(\rightsquigarrow\) regression trees with \(L2\) loss perform piecewise constant linear regression.
- Explain why performing further splits can never result in a higher overall risk with \(L2\) loss as a splitting criterion.
Solution
The variance of a subset of the observations in a node cannot be higher than the variance of the entire node, so it’s not possible to make the tree worse (w.r.t. training error) by introducing a further split.
Exercise 2: Splitting criteria
In this exercise, we will have a look at two of the most important CART hyperparameters, i.e., design choices exogenous to training. Both minsplit and maxdepth influence the number of input space partitions the CART will perform.
- How do you expect the number of splits to affect the model fit and generalization performance?
Solution
Allowing for more splits will make the model more complex, thus—all else being equal—achieving a better data fit but also increasing the risk of overfitting.
Using
mlr3, fit a regression tree learner (regr.rpart) to thebike_sharingtask (omitting thedatefeature) formaxdepth\(\in \{2, 4, 8\}\) withminsplit\(= 2\)minsplit\(\in \{5, 1000, 10000\}\) withmaxdepth\(= 20\)
What do you observe?
Solution
library(mlr3verse)
library(rattle)
task <- tsk("bike_sharing")
task$select(task$feature_names[task$feature_names != "date"])
for (i in c(2, 4, 8)) {
learner <- lrn("regr.rpart", minsplit = 2, maxdepth = i, minbucket = 1)
learner$train(task)
fancyRpartPlot(learner$model, caption = sprintf("maxdepth: %i", i))
}for (i in c(5, 1000, 10000)) {
learner <- lrn("regr.rpart", minsplit = i, maxdepth = 20, minbucket = 1)
learner$train(task)
fancyRpartPlot(learner$model, caption = sprintf("minsplit: %i", i))
}Higher values of maxdepth and lower values of minsplit, respectively, produce more complex trees.
- Which of the two options should we use to control tree appearance?
Solution
Both hyperparameters can be used to control tree depth, and their effect depends on the properties of the data-generating process (e.g., at least 100 observations to split further can mean very pure or very impure nodes). Sometimes requirements like interpretability might steer our decision (e.g., a tree of depth 15 is probably hard to interpret). Usually, however, we will employ hyperparameter tuning to determine the values for both (and other) hyperparameters, deferring the responsibility from the practitioner to the data.
Exercise 3: Impurity reduction
This exercise is rather involved and requires some knowledge of probability theory.
Main take-away (besides training proof-type questions): In constructing CART with minimal Gini impurity, we minimize the expected rate of misclassification across the training data.
We will now build some intuition for the Brier score / Gini impurity as a splitting criterion by showing that it is equal to the expected MCE of the resulting node.
The fractions of the classes \(k=1,\ldots, g\) in node \(\mathcal{N}\) of a decision tree are \(\pi^{(\mathcal{N})}_1,\ldots,\pi^{(\mathcal{N})}_g\), where \[ \pi_k^{(\mathcal{N})}= \frac{1}{|\mathcal{N}|} \sum_{(x^{(i)},y^{(i)}) \in \mathcal{N}} [y^{(i)} = k]. \]
For an expression that holds in expectation over arbitrary data, we need to introduce stochasticity. Assume we replace the (deterministic) classification rule in node \(\mathcal{N}\) \[ \hat{k}~|~\mathcal{N}=\arg\max_k \pi_k^{(\mathcal{N})} \]
by a randomizing rule \[ \hat k \sim \text{Cat} \left(\pi^{(\mathcal{N})}_1,\ldots,\pi^{(\mathcal{N})}_g \right), \]
in which we draw the classes from the categorical distribution of their estimated probabilities (i.e., class \(k\) is predicted with probability \(\pi^{(\mathcal{N})}_k\)).
- Explain the difference between the deterministic and the randomized classification rule.
Solution
The deterministic rule tells us to pick the class with the highest empirical frequency. With the randomizing rule, we draw from a categorical distribution that is parameterized with these same empirical frequencies, meaning we draw the most frequent class with probability \(\gamma = \max_k \pi_k^{(\mathcal{N})}\). If we repeat this process many times, we will predict this class \(\gamma\) % of the time. Therefore, in expectation, we will also predict the most frequent class in most cases, but the rule is more nuanced as the magnitude of \(\gamma\) makes a difference (the closer to 1, the more similar both rules).
- Using the randomized rule, compute the expected MCE in node \(\mathcal{N}\) that contains \(n\) random training samples. What do you notice?
Solution
In order to compute the expected MCE, we need some random variables (RV) because we want to make a statement that holds for arbitrary training data drawn from the data-generating process. More precisely, we define \(n \in \mathbb{N}\) i.i.d. RV \(Y^{(1)}, \dots, Y^{(n)}\) that are distributed according to the categorical distribution induced by the observed class frequencies:
\[ \mathbb{P}(Y^{(i)} = k| \mathcal{N}) = \pi_k^{(\mathcal{N})}\quad \forall i \in \{1, \ldots, n\}, \quad k \in \mathcal{Y}. \]
The label \(y^{(i)}\) of the \(i\)-th training observation is thus a realization of the corresponding RV \(Y^{(i)}\).
Since our new randomization rule is stochastic, the predictions for the training observations will also be realizations of RV that we denote by \(\hat{Y}^{(1)}, ..., \hat{Y}^{(n)}\). By design, they follow the same categorical distribution:
\[ \mathbb{P}(\hat Y^{(i)} = k| \mathcal{N}) = \pi_k^{(\mathcal{N})}\quad \forall i \in \{1, \ldots, n\}, \quad k \in \mathcal{Y}. \]
Then, we can define the MCE for a node with \(n = |\mathcal{N}|\) when the randomizing rule is used:
\[ \rho_{\text{MCE}}(\mathcal{N}) = \frac{1}{n} \sum\limits_{i=1}^n\left[ Y^{(i)} \neq \hat Y^{(i)}\right]. \]
Taking the expectation of this MCE leads to a statement about a node with arbitrary training data:
\[\begin{align*} \mathbb{E}_{Y^{(1)}, \dots, Y^{(n)}, \hat{Y}^{(1)}, \dots, \hat{Y}^{(n)}} \left(\rho_{\text{MCE}}(\mathcal{N}) \right) &= \mathbb{E}_{Y^{(1)}, \dots,Y^{(n)}, \hat{Y}^{(1)}, \dots, \hat{Y}^{(n)}} \left(\frac{1}{n} \sum\limits_{i=1}^n\left[Y^{(i)} \neq \hat{Y}^{(i)} \right] \right) \\ & = \frac{1}{n} \sum\limits_{i=1}^n\mathbb{E}_{Y^{(i)}, \hat{Y}^{(i)}} \left( \left[Y^{(i)} \neq \hat{Y}^{(i)} \right]\right) ~~ \text{i.i.d. assumption + linearity} \\ & = \frac{1}{n} \sum\limits_{i=1}^n\mathbb{E}_{Y^{(i)}}\left( \mathbb{E}_{\hat{Y}^{(i)}} \left( \left[Y^{(i)} \neq \hat{Y}^{(i)} \right] \right) \right) ~~ \text{Fubini's theorem} \\ & = \frac{1}{n} \sum\limits_{i=1}^n\mathbb{E}_{Y^{(i)}} \left( \sum_{k \in \mathcal{Y}} \pi_k^{(\mathcal{N})}\cdot \left[ Y^{(i)} \neq k \right] \right) \\ % & = \meanin \E_{Y^{(i)}} \left( \sum_{k \in \Yspace % \setminus \{Y^{(i)}\}} \pikN \right) \\ & = \frac{1}{n} \sum\limits_{i=1}^n\mathbb{E}_{Y^{(i)}} \left( 1 - \pi^{(\mathcal{N})}_{k = Y^{(i)}} \right) \\ & = \frac{1}{n} \sum\limits_{i=1}^n\sum\limits_{k=1}^g\pi_k^{(\mathcal{N})}\cdot \left(1 - \pi_k^{(\mathcal{N})}\right) \\ & = n \cdot \frac{1}{n} \sum\limits_{k=1}^g\pi_k^{(\mathcal{N})}\cdot \left(1 - \pi_k^{(\mathcal{N})}\right) \\ &= \sum\limits_{k=1}^g\pi_k^{(\mathcal{N})}\cdot \left(1 - \pi_k^{(\mathcal{N})}\right). \end{align*}\]
This is precisely the Gini index CART use for splitting with Brier score. Gini impurity can thus be viewed as the expected frequency with which the training samples will be misclassified in a given node \(\mathcal{N}\).
In other words, in constructing CART with minimal Gini impurity, we minimize the expected rate of misclassification across the training data.